{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Regressão Linear Simples - Trabalho\n",
    "\n",
    "## Estudo de caso: Seguro de automóvel sueco\n",
    "\n",
    "Agora, sabemos como implementar um modelo de regressão linear simples. Vamos aplicá-lo ao conjunto de dados do seguro de automóveis sueco. Esta seção assume que você baixou o conjunto de dados para o arquivo insurance.csv, o qual está disponível no notebook respectivo.\n",
    "\n",
    "O conjunto de dados envolve a previsão do pagamento total de todas as reclamações em milhares de Kronor sueco, dado o número total de reclamações. É um dataset composto por 63 observações com 1 variável de entrada e 1 variável de saída. Os nomes das variáveis são os seguintes:\n",
    "\n",
    "1. Número de reivindicações.\n",
    "2. Pagamento total para todas as reclamações em milhares de Kronor sueco.\n",
    "\n",
    "Voce deve adicionar algumas funções acessórias à regressão linear simples. Especificamente, uma função para carregar o arquivo CSV chamado *load_csv ()*, uma função para converter um conjunto de dados carregado para números chamado *str_column_to_float ()*, uma função para avaliar um algoritmo usando um conjunto de treino e teste chamado *split_train_split ()*, a função para calcular RMSE chamado *rmse_metric ()* e uma função para avaliar um algoritmo chamado *evaluate_algorithm()*.\n",
    "\n",
    "Utilize um conjunto de dados de treinamento de 60% dos dados para preparar o modelo. As previsões devem ser feitas nos restantes 40%. \n",
    "\n",
    "Compare a performabce do seu algoritmo com o algoritmo baseline, o qual utiliza a média dos pagamentos realizados para realizar a predição ( a média é 72,251 mil Kronor).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<bound method NDFrame.head of     numrei  pgtTotal\n",
      "0      108     392.5\n",
      "1       19      46.2\n",
      "2       13      15.7\n",
      "3      124     422.2\n",
      "4       40     119.4\n",
      "5       57     170.9\n",
      "6       23      56.9\n",
      "7       14      77.5\n",
      "8       45     214.0\n",
      "9       10      65.3\n",
      "10       5      20.9\n",
      "11      48     248.1\n",
      "12      11      23.5\n",
      "13      23      39.6\n",
      "14       7      48.8\n",
      "15       2       6.6\n",
      "16      24     134.9\n",
      "17       6      50.9\n",
      "18       3       4.4\n",
      "19      23     113.0\n",
      "20       6      14.8\n",
      "21       9      48.7\n",
      "22       9      52.1\n",
      "23       3      13.2\n",
      "24      29     103.9\n",
      "25       7      77.5\n",
      "26       4      11.8\n",
      "27      20      98.1\n",
      "28       7      27.9\n",
      "29       4      38.1\n",
      "..     ...       ...\n",
      "33       5      40.3\n",
      "34      22     161.5\n",
      "35      11      57.2\n",
      "36      61     217.6\n",
      "37      12      58.1\n",
      "38       4      12.6\n",
      "39      16      59.6\n",
      "40      13      89.9\n",
      "41      60     202.4\n",
      "42      41     181.3\n",
      "43      37     152.8\n",
      "44      55     162.8\n",
      "45      41      73.4\n",
      "46      11      21.3\n",
      "47      27      92.6\n",
      "48       8      76.1\n",
      "49       3      39.9\n",
      "50      17     142.1\n",
      "51      13      93.0\n",
      "52      13      31.9\n",
      "53      15      32.1\n",
      "54       8      55.6\n",
      "55      29     133.3\n",
      "56      30     194.5\n",
      "57      24     137.9\n",
      "58       9      87.4\n",
      "59      31     209.8\n",
      "60      14      95.5\n",
      "61      53     244.6\n",
      "62      26     187.5\n",
      "\n",
      "[63 rows x 2 columns]>\n"
     ]
    }
   ],
   "source": [
    "columns = ['numrei','pgtTotal']\n",
    "dataset = pd.read_csv('insurance.csv' ,header = None, names = columns)\n",
    "print(dataset.head)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Calculate the mean value of a list of numbers\n",
    "def mean(values):\n",
    "  return sum(values) / float(len(values))\n",
    "\n",
    "# Calculate the variance of a list of numbers\n",
    "def variance(values, mean):\n",
    "  return sum([(x-mean)**2 for x in values])\n",
    "\n",
    "# Calculate covariance between x and y\n",
    "def covariance(x, mean_x, y, mean_y):\n",
    "  covar = 0.0\n",
    "  for i in range(len(x)):\n",
    "    covar += (x[i] - mean_x) * (y[i] - mean_y)\n",
    "  return covar\n",
    "\n",
    "# Calculate coefficients\n",
    "def coefficients(dataset):\n",
    "      x = [row[0] for row in dataset]\n",
    "      y = [row[1] for row in dataset]\n",
    "      x_mean, y_mean = mean(x), mean(y)\n",
    "      b1 = (covariance(x, x_mean, y, y_mean) / variance(x, x_mean))\n",
    "      b0 = y_mean - b1 * x_mean\n",
    "      return [b0, b1]\n",
    "\n",
    "def simple_linear_regression(train, test):\n",
    "  predictions = list()\n",
    "  b0, b1 = coefficients(train)\n",
    "  for row in test:\n",
    "    ypred = b0 + b1 * row[0]\n",
    "    predictions.append(ypred)\n",
    "  return predictions\n",
    "\n",
    "from math import sqrt\n",
    "\n",
    "# Calculate root mean squared error\n",
    "def rmse_metric(actual, predicted):\n",
    "  sum_error = 0.0\n",
    "  for i in range(len(actual)):\n",
    "    prediction_error = predicted[i] - actual[i]\n",
    "    sum_error += (prediction_error ** 2)\n",
    "  mean_error = sum_error / float(len(actual))\n",
    "  return sqrt(mean_error)\n",
    "\n",
    "# Evaluate regression algorithm on training dataset\n",
    "def evaluate_algorithm(dataset, algorithm):\n",
    "  test_set = list()\n",
    "  for row in dataset:\n",
    "    row_copy = list(row)\n",
    "    row_copy[-1] = None\n",
    "    test_set.append(row_copy)\n",
    "  predicted = algorithm(dataset, test_set)\n",
    "  print(predicted)\n",
    "  actual = [row[-1] for row in dataset]\n",
    "  rmse = rmse_metric(actual, predicted)\n",
    "  return rmse"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "dataset['numrei'] = pd.to_numeric(dataset['numrei'], errors='coerce').fillna(0)\n",
    "dataset['pgtTotal'] = pd.to_numeric(dataset['pgtTotal'], errors='coerce').fillna(0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    numrei  pgtTotal\n",
      "0      108     392.5\n",
      "1       19      46.2\n",
      "2       13      15.7\n",
      "3      124     422.2\n",
      "4       40     119.4\n",
      "5       57     170.9\n",
      "6       23      56.9\n",
      "7       14      77.5\n",
      "8       45     214.0\n",
      "9       10      65.3\n",
      "10       5      20.9\n",
      "11      48     248.1\n",
      "12      11      23.5\n",
      "13      23      39.6\n",
      "14       7      48.8\n",
      "15       2       6.6\n",
      "16      24     134.9\n",
      "17       6      50.9\n",
      "18       3       4.4\n",
      "19      23     113.0\n",
      "20       6      14.8\n",
      "21       9      48.7\n",
      "22       9      52.1\n",
      "23       3      13.2\n",
      "24      29     103.9\n",
      "25       7      77.5\n",
      "26       4      11.8\n",
      "27      20      98.1\n",
      "28       7      27.9\n",
      "29       4      38.1\n",
      "..     ...       ...\n",
      "33       5      40.3\n",
      "34      22     161.5\n",
      "35      11      57.2\n",
      "36      61     217.6\n",
      "37      12      58.1\n",
      "38       4      12.6\n",
      "39      16      59.6\n",
      "40      13      89.9\n",
      "41      60     202.4\n",
      "42      41     181.3\n",
      "43      37     152.8\n",
      "44      55     162.8\n",
      "45      41      73.4\n",
      "46      11      21.3\n",
      "47      27      92.6\n",
      "48       8      76.1\n",
      "49       3      39.9\n",
      "50      17     142.1\n",
      "51      13      93.0\n",
      "52      13      31.9\n",
      "53      15      32.1\n",
      "54       8      55.6\n",
      "55      29     133.3\n",
      "56      30     194.5\n",
      "57      24     137.9\n",
      "58       9      87.4\n",
      "59      31     209.8\n",
      "60      14      95.5\n",
      "61      53     244.6\n",
      "62      26     187.5\n",
      "\n",
      "[63 rows x 2 columns]\n"
     ]
    }
   ],
   "source": [
    "print(dataset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "numrei      22.904762\n",
       "pgtTotal    98.187302\n",
       "dtype: float64"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset.mean(numeric_only = True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0     108\n",
      "1      19\n",
      "2      13\n",
      "3     124\n",
      "4      40\n",
      "5      57\n",
      "6      23\n",
      "7      14\n",
      "8      45\n",
      "9      10\n",
      "10      5\n",
      "11     48\n",
      "12     11\n",
      "13     23\n",
      "14      7\n",
      "15      2\n",
      "16     24\n",
      "17      6\n",
      "18      3\n",
      "19     23\n",
      "20      6\n",
      "21      9\n",
      "22      9\n",
      "23      3\n",
      "24     29\n",
      "25      7\n",
      "26      4\n",
      "27     20\n",
      "28      7\n",
      "29      4\n",
      "     ... \n",
      "33      5\n",
      "34     22\n",
      "35     11\n",
      "36     61\n",
      "37     12\n",
      "38      4\n",
      "39     16\n",
      "40     13\n",
      "41     60\n",
      "42     41\n",
      "43     37\n",
      "44     55\n",
      "45     41\n",
      "46     11\n",
      "47     27\n",
      "48      8\n",
      "49      3\n",
      "50     17\n",
      "51     13\n",
      "52     13\n",
      "53     15\n",
      "54      8\n",
      "55     29\n",
      "56     30\n",
      "57     24\n",
      "58      9\n",
      "59     31\n",
      "60     14\n",
      "61     53\n",
      "62     26\n",
      "Name: numrei, Length: 63, dtype: int64\n"
     ]
    }
   ],
   "source": [
    "x = dataset['numrei']\n",
    "y = dataset['pgtTotal']\n",
    "print(x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "98.18730158730159"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mean(x)\n",
    "mean(y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=ShuffleSplit(n_splits=1, random_state=100, test_size=0.4, train_size=None),\n",
       "       error_score='raise',\n",
       "       estimator=DummyRegressor(constant=None, quantile=None, strategy='mean'),\n",
       "       fit_params=None, iid=True, n_jobs=1,\n",
       "       param_grid={'strategy': ['mean', 'median']},\n",
       "       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,\n",
       "       scoring='neg_mean_squared_error', verbose=0)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.dummy import DummyRegressor\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "from sklearn.model_selection import ShuffleSplit\n",
    "\n",
    "dm = DummyRegressor()\n",
    "param_grid = {\"strategy\": [\"mean\", \"median\"]}\n",
    "ss = ShuffleSplit(n_splits=3, test_size=.4, random_state=100)\n",
    "\n",
    "cv = GridSearchCV(dm, cv=ss, param_grid=param_grid, scoring=\"neg_mean_squared_error\")\n",
    "cv.fit(dataset[['numrei']], dataset['pgtTotal'])\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "8787.5998097994034"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cv.best_score_ * -1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=ShuffleSplit(n_splits=1, random_state=100, test_size=0.4, train_size=None),\n",
       "       error_score='raise',\n",
       "       estimator=LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False),\n",
       "       fit_params=None, iid=True, n_jobs=1, param_grid={},\n",
       "       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,\n",
       "       scoring='neg_mean_squared_error', verbose=0)"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "import math\n",
    "\n",
    "ln = LinearRegression()\n",
    "\n",
    "cv = GridSearchCV(ln, param_grid = {}, cv=ss, scoring=\"neg_mean_squared_error\")\n",
    "cv.fit(dataset[['numrei']], dataset['pgtTotal'])\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1477.9873920103482"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cv.best_score_ *-1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
